06. Edge Case: Character Encodings

Edge Case: File Encodings

In this section, you will learn about Unicode encodings (like UTF-8, UTF-16), and how they are used.

ND079 JPND C2 L02 A08 Edge Case Character Encodings

Why Do We Need Character Encodings?

When you open a text file on your PC, you see readable text. To a computer though, that file contains only binary data (1s and 0s). All files are stored as bits. By the way, "bit", which stands for "binary digit", but it's more common to talk about bytes. A byte is equal to 8 bits.

Thankfully, computers can translate that binary data into readable text for us to read. Character sets enumerate all possible characters that can be represented by an encoding. Unicode is the most common character set, and it can represent 143,859 characters and symbols in many different languages. There's also a character set called ASCII, which can only represent characters that are common in the English language.

What Are Character Encodings?

A character encoding is a way to convert between binary data and human-readable text characters in a character set.

You saw in the previous section that Readers and Writers use standard character encodings to convert between bytes and text.

Why do we need to specify character encodings when reading and writing files?

SOLUTION: The files only contain binary data (0s and 1s), and programs need to know how to translate that into human-readable text.

Different Unicode Encodings

UTF-8

Generally speaking, you should use UTF-8. Most HTML documents use this encoding.

It uses at least 8 bits of data to store each character. This can lead to more efficient storage, especially when the text contains mostly English ASCII characters. But higher-order characters, such as non-ASCII characters, may require up to 24 bits each!

UTF-16.

This encoding uses at least 16 bits to encode characters, including lower-order ASCII characters and higher-order non-ASCII characters.

If you are encoding text consisting of mostly non-English or non-ASCII characters, UTF-16 may result in a smaller file size. But if you use UTF-16 to encode mostly ASCII text, it will use up more space.

ND079 JPND C2 L02 A09 Demo Character Encodings

Code from the Demo

import java.io.IOException;
import java.io.Writer;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;

public class Encode {
    public static void main(String[] args) throws IOException {
        try (Writer writer = Files.newBufferedWriter(Path.of("L2-demo3-encodings/test_utf8.txt"),
                StandardCharsets.UTF_8)) {
            writer.write("hello, world");
        }

        try (Writer writer = Files.newBufferedWriter(Path.of("L2-demo3-encodings/test_utf16.txt"),
                StandardCharsets.UTF_16)) {
            writer.write("hello, world");
        }
    }
}

In the demo, you saw how UTF-8 resulted in a smaller file size than UTF-16 when the text contained all ASCII (English) characters.

QUIZ QUESTION::

Match each encoding with the corresponding description:

ANSWER CHOICES:



Encoding

Feature

UTF-8

UTF-16

UTF-7

UTF-32

SOLUTION:

Encoding

Feature

UTF-32

UTF-8

UTF-16

UTF-7

Further Reading